Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

نویسندگان

Julie Nutini

Mark W. Schmidt

Issam H. Laradji

Michael P. Friedlander

Hoyt A. Koepke

چکیده

There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J. Optim., 22(2), 2012 ], who showed that a random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection rule. This result suggests that we should never use the Gauss-Southwell rule, because it is typically much more expensive than random selection. However, the empirical behaviours of these algorithms contradict this theoretical result: in applications where the computational costs of the selection rules are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that—except in extreme cases—its convergence rate is faster than choosing random coordinates. We also (i) show that exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and (iv) analyze proximal-gradient variants of the Gauss-Southwell rule. 1 Coordinate Descent Methods There has been substantial recent interest in applying coordinate descent methods to solve large-scale optimization problems, starting with the seminal work of Nesterov [2012], who gave the first global rate-ofconvergence analysis for coordinate-descent methods for minimizing convex functions. This analysis suggests that choosing a random coordinate to update gives the same performance as choosing the “best” coordinate to update via the more expensive Gauss-Southwell (GS) rule. (Nesterov also proposed a more clever randomized scheme, which we consider later in this paper.) This result gives a compelling argument to use randomized coordinate descent in contexts where the GS rule is too expensive. It also suggests that there is no benefit to using the GS rule in contexts where it is relatively cheap. But in these contexts, the GS rule often substantially outperforms randomized coordinate selection in practice. This suggests that either the analysis of GS is not tight, or that there exists a class of functions for which the GS rule is as slow as randomized coordinate descent. After discussing contexts in which it makes sense to use coordinate descent and the GS rule, we answer this theoretical question by giving a tighter analysis of the GS rule (under strong-convexity and standard smoothness assumptions) that yields the same rate as the randomized method for a restricted class of functions, but is otherwise faster (and in some cases substantially faster). We further show that, compared to the usual constant step-size update of the coordinate, the GS method with exact coordinate optimization has a provably faster rate for problems satisfying a certain sparsity constraint (Section 5). We believe that this is the first result showing a theoretical benefit of exact coordinate optimization; all previous analyses show that these strategies obtain the same rate as constant step-size updates, even though exact optimization tends to be faster in practice. Furthermore, in Section 6, we propose a variant of the GS rule that, similar to Nesterov’s more clever randomized sampling scheme, uses knowledge of the Lipschitz constants of the coordinate-wise gradients to obtain a faster rate. We also analyze approximate GS rules (Section 7), which 1 ar X iv :1 50 6. 00 55 2v 1 [ m at h. O C ] 1 J un 2 01 5 provide an intermediate strategy between randomized methods and the exact GS rule. Finally, we analyze proximal-gradient variants of the GS rule (Section 8) for optimizing problems that include a separable nonsmooth term. 2 Problems of Interest The rates of Nesterov show that coordinate descent can be faster than gradient descent in cases where, if we are optimizing n variables, the cost of performing n coordinate updates is similar to the cost of performing one full gradient iteration. This essentially means that coordinate descent methods are useful for minimizing convex functions that can be expressed in one of the following two forms:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In th...

متن کامل

Supplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments"

f(α) ≡ g(Eα) + bα, f(·) and g(·) are proper closed functions, E is a constant matrix, and Li ∈ [−∞,∞), Ui ∈ (−∞,∞] are lower/upper bounds. It has been checked in [1] that l1 and l2 loss SVM are in the form of (I.1) and satisfy additional assumptions needed in [4]. We introduce an important class of gradient-based scheme for CD’s variable selection: the Gauss-Southwell rule. It plays an importan...

متن کامل

Finding a Maximum Weight Sequence with Dependency Constraints

In this essay, we consider the following problem: We are given a graph and a weight associated with each vertex, and we want to choose a sequence of vertices that maximizes the sum of the weights, subject to some constraints arising from dependencies between vertices. We consider several versions of this problem with different constraints. These problems have applications in finding the converg...

متن کامل

Accelerating ISTA with an active set strategy

Starting from a practical implementation of Roth and Fisher’s algorithm to solve a Lasso-type problem, we propose and study the Active Set Iterative Shrinkage/Thresholding Algorithm (AS-ISTA). The convergence is proven by observing that the algorithm can be seen as a particular case of a coordinate gradient descent algorithm with a Gauss-Southwell-r rule. We provide experimental evidence that t...

متن کامل

Approximate Steepest Coordinate Descent

We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to n, the number of coordinates. In many practical applicatio...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

نویسندگان

چکیده

منابع مشابه

Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Supplementary materials for "Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments"

Finding a Maximum Weight Sequence with Dependency Constraints

Accelerating ISTA with an active set strategy

Approximate Steepest Coordinate Descent

عنوان ژورنال:

اشتراک گذاری